5. Main Analysis
5.1 Individual restaurant information locator
Say someone wants to choose a restaurant to eat and cares about the hygiene situation of the restaurant.
In this page, user can input the name of restaurant and check the inspection results. For example, user wants to eat subway around Columbia. Input the subway and location, it shows three subway restaurants on the map.
Check these three subway, the one on 348 MALCOLM X BOULEVARD has food temperature, hygiene and container problems and the inspection score is very high which means it sanitation is not very good.
And the one on 578 W 125th ST, it only has food container problem and the score is very low on 2017. Therefore, according to the analysis result. The user will prefer this one.
If user does not know which specific restaurant to eat and input the cuisine type and location, It will present all restaurants on the map and user can click each of them to see the analysis of sanitation condition and then choose which one to eat. Generally, we show inspection history result and compare this restaurant with other same cuisine type restaurants in this location. User can choose the restaurant based on these information.
For example, when user wants to eat African food near Columbia, input the information and it shows five restaurants.
For clicks ACCORA.RESTAURANT, the line graph shows this restaurant inspection score history. The lower score is better. Its hygiene is good in 2015 but not very good recently. And during the inspection, this restaurant has food temperature, facility and hygiene issues. In bar chart, compared with other near African restaurants inspection results, ACCORA RESTAURANT has more serious issue on food temperature.
For clicks J.RESTAURANT, the line graph shows this restaurant inspection score history. The lower score is better. For J. RESTURANT, its hygiene is good in 2015. And during the inspection, this restaurant has food temperature, food source and hygiene issues. In bar chart, compared with other near African restaurants inspection results, J.RESTAURANT has more serious issues on food resource and temperature.
We use leaflet package to show the map. And use popup to show the restaurant details including location, zipcode, cuisine type and phone number. When user clicks one restaurant, it will show the analysis of inspection result. The difficult part is to implement the interaction between ui.R and server.R. When user choose one restaurant on the map, it will return the id of this restaurant and process data by this id in the server.R.
5.2 Inspection performance among different neighborhoods
To get a sense of the overview hygiene condition of all restaurants in New York City, we firstly think of a map to display where the restaurants’ hygiene condition is better and where is worse. Thus, we use a filled map to display the mean/median inspection scores for all neighborhoods by colors.
One challenge we met was to find neighborhood boundaries in New York City. We found out that the neighborhood boundaries were not well defined. For example, Google maps and the neighborhood boundaries on the real estate website Streeteasy gives different boundary for Harlem. Also, neighborhood area may be too large to get detail information of smaller areas. Then we decided to use zip codes as to divide areas. Zip codes boundaries were well defined and their size is reasonable to be an area to study.
Secondly, since we have data of multiple years, we would like to see whether the hygiene condition improved over the years and where they are. So we add the parameter year to enable us to see the yearly change. From the data, we have complete and decent number of inspections after year 2013, thus we didn’t use the data of year 2012.
In terms of the yearly change, the performance of restaurants on food temperature was quite stable over the years. Overall, the restaurants in NYC performed well on reasonable food temperature. However, when looking at the quality of food source, we can see that the map shows more dark areas in year 2017 than previous years, which means that the quality of food source in the first quarter of year 2017 got worse compared to previous years.
Also, different people may weigh the violation type differently. For example, I think the problem of food source is more serious than whether the food is too hot to be served. But one of our group member has different thoughts. So we add another parameter, violation type, to enable us to see the performance of restaurants under different type of inspections. Another interesting finding from this is to see which kind of sanitation issue is more serious for most restaurants.
When looking at the different violation codes, we found that there are 75 different kinds of violations. With closer look, we found these violations can be reclassified into smaller number of types, which is easier to understand. For example, all violations with ‘02’ in violation code are related to the problem with food source. So we made them into the class ‘Problem with food source’.
library(dplyr)
data_score <- data[which(!is.na(data$SCORE)),]
data_score <- data_score[which(!is.na(data_score$VIOLATION.CODE)),]
data_score$VIOLATION.TYPE <- sapply(data_score$VIOLATION.CODE, substring, 1, 2)
type <- data_score %>% select(VIOLATION.TYPE, VIOLATION.CODE,VIOLATION.DESCRIPTION)%>%
arrange(VIOLATION.TYPE)
type <- unique(type)
type[1:5,]
## VIOLATION.TYPE VIOLATION.CODE
## 1 02 02B
## 3 02 02E
## 4 02 02A
## 5 02 02G
## 6 02 02H
## VIOLATION.DESCRIPTION
## 1 Hot food item not held at or above 140\x92\xe4\xce_ F.
## 3 Whole frozen poultry or poultry breasts, other than a single portion, is being cooked frozen or partially thawed.
## 4 Food not cooked to required minimum temperature.
## 5 Cold food item held above 41\x92\xe4\xce_ F (smoked fish and reduced oxygen packaged foods above 38 \x92\xe4\xce_F) except during necessary preparation.
## 6 Food not cooled by an approved method whereby the internal product temperature is reduced from 140\x92\xe4\xce_ F to 70\x92\xe4\xce_ F or less within 2 hours, and from 70\x92\xe4\xce_ F to 41\x92\xe4\xce_ F or less within 4 additional hours.
Here, we find it interesting that among all different violation types, the variance of hygiene conditions about food source among all neighborhoods is the largest. For other violation type, the scores are relatively lower, which means the violation issues are not that serious. So when we are concerned about hygiene condition, we should pay attention to the quality of the food source.
5.3 Other analysis
The third panel is the analysis which is not related to map and location. We use traditional bar plot as the tool and we are focusing on restaurant type and violation type.
For the restaurant type we first find and order it by the number of inspection. This plot can tells us the distribution of restaurant in our data set. Note that the American and Chinese restaurants is the majority of our data set.
library(ggplot2)
ggplot(data[which(!is.na(data$CUISINE.DESCRIPTION)),], aes(x = factor(CUISINE.DESCRIPTION, levels = names(sort(table(CUISINE.DESCRIPTION), decreasing = TRUE)))))+ geom_bar(stat='count')+labs(x="Restaurant Type",y="number of inspection",title="Restaurant Type Distribution")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
The second thing and the most important thing of the restaurant type we are trying to find is the mean score of each restaurant type. From the plot below, we can see that Creole restaurants has the highest mean score of 22.63 which equals to B grade and the lowest mean score is 8.130435 which is the ‘Soups’ restaurant. Note that the lower the score is the cleaner enviroment the restaurant can provide.
ggplot(aggregate(SCORE~CUISINE.DESCRIPTION, data[which(!is.na(data$SCORE)),], mean),aes(x=factor(CUISINE.DESCRIPTION, levels = CUISINE.DESCRIPTION[order(SCORE)]),y=SCORE))+geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+labs(x="Restaurant Type",y="Mean of Score",title="Mean Score of Restaurant Type")
The third thing we are looking for is what is the most common critical violation of each restaurant type. From the graph we can see that the 10F is the most common critical violation in most of the restaurant. Note that violation code 10F is “Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit”. We can tell that this is the common health problem in the restaurants of NYC.
most_vio_rt<-data[!is.na(data$VIOLATION.CODE),][data[, "CRITICAL.FLAG"] == "Critical",] %>% count(CUISINE.DESCRIPTION, VIOLATION.CODE) %>%slice(which.max(n))
ggplot(most_vio_rt, aes(x = factor(CUISINE.DESCRIPTION, levels = CUISINE.DESCRIPTION[order(-n)]),y=n))+ geom_bar(stat='identity') +labs(x="Restaurant Type",y="Number of Violation Type",title="The Most Common Critical Violation in Each Restaurant Type")+geom_text(aes(label=VIOLATION.CODE),size=1.5,vjust=-0.5)+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 1 rows containing missing values (geom_text).
Then we do the similar analysis on the violation type. We first find and order it by the number of inspection. This plot can tells us the distribution of violation type in our data set. As expected, 10F is the most violated problem. And the violated problems is most the top 9 problems.
ggplot(data[which(!is.na(data$VIOLATION.CODE)),], aes(x = factor(VIOLATION.CODE, levels = names(sort(table(VIOLATION.CODE), decreasing = TRUE)))))+ geom_bar(stat='count')+labs(x="Violation Type",y="number of inspection",title="Violation Type Distribution")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
The second thing of the violation type we are trying to find is the mean score of each violation type. This represents the harmfulness of each violation type. We can see that the most harmless violation type is 20F (Current letter grade card not posted) with score 7 and the most critical violation type is 06H (Records and logs not maintained to demonstrate that HACCP plan has been properly implemented) with score 60.56. Note that the most common violation type 10F has score 13.42 which is still in A range.
ggplot(aggregate(SCORE~VIOLATION.CODE, data[which(!is.na(data$SCORE)),], mean),aes(x=factor(VIOLATION.CODE, levels = VIOLATION.CODE[order(SCORE)]),y=SCORE))+geom_bar(stat="identity")+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+labs(x="Violation Type",y="Mean of Score",title="Mean Score of Violation Type")
The last plot of the third panel is the distribution of inspection over time. Our data is from 2012-2017 and the majority of the data is form 2013-2017.
data$INSPECTION.DATE <- as.Date(data$INSPECTION.DATE, format="%m/%d/%y")
data <- data[which(data$INSPECTION.DATE != '2000-01-01'),]
data <- data[data[, "lat"] != 0,]
data <- data[data[, "ing"] != 0,]
data$YEAR <- as.numeric(format(data$INSPECTION.DATE, "%Y"))
ggplot(data[which(!is.na(data$INSPECTION.DATE)),],aes(INSPECTION.DATE))+geom_histogram(binwidth = 30)+labs(x="Inspection Date",y="number of inspection",title="Number of Inspection over Time")
Also, notice that we want to create a map in our shiny application, thus we will need data of longitute and latitude of each restaurant which is not provided in this data set. Thus we wrote a python file to get such information through google map. The Python code is listed below.
#!/usr/bin/python
import csv
import requests
import time
import json
appid = "35348ee9"
appkey = "874bc0a8aaafe29bbe84abaeb78fd57d"
with open('restaurant.csv', 'rb') as readfile, open('restaurant1.csv', 'a') as writefile:
writer = csv.writer(writefile, delimiter=',')
reader = csv.reader(readfile, delimiter=',')
last_address = ""
last_coord = (0, 0)
for row in reader:
if row[0] == "CAMIS":
writer.writerow(row + ['lat', 'ing'])
continue
address = row[3] + row[4] + row[2]
if address == last_address:
writer.writerow(row + [last_coord[0], last_coord[1]])
continue
last_address = address
url = "https://api.cityofnewyork.us/geoclient/v1/address.json?houseNumber={0}&street={1}&borough={2}&app_id={3}&app_key={4}".format(row[3], row[4], row[2], appid, appkey)
while True:
try:
r = requests.get(url).json()
lat = r['address']['latitude']
lon = r['address']['longitude']
writer.writerow(row + [lat, lon])
print (lat, lon)
break
except Exception as e:
print e
print address
writer.writerow(row + ['0.0', '0.0'])
break
#https://api.cityofnewyork.us/geoclient/v1/address.json?houseNumber=30&street=ROCKFEELLER PLAZA&borough=Manhattan&app_id=35348ee9&app_key=874bc0a8aaafe29bbe84abaeb78fd57d